Telco Customer Churn Project: Summary Paper
1 Chapter 1: Introudction of Company
1.1 Basic Information
Telco Systems is a company which have been working on design and development of high-performance network communications over 40 years. It is the global leader in telecommunications, providing excellent telecommunication service for customers. The service includes 5G internet, networking slicing, and more on. In our group, all of us have not tried to analyze such a company in telecommunication field.
1.2 Reason of Choosing Telco Systems
The reason of choosing this company is that Telco provides large amount of home phone service and internet service. These two stuffs are unavoidable in contemporary society. People get used to take mobile phones to go outside, and they adapt to such a life with internet. When people stay at home, they spend majority of times with interenet no matter working or playing. We can say that people cannot leave internet nowadays. Therefore, the company which is providing these services are pretty crucial. Telco is a good company, but it also has some drawbacks. Not all users satisfy with service, so we want to know in what place Telco can make an improvement. We care about how customers say about Telco, and what they dislike. We are here to help the company to find specific problems, to avoid unnecessary customers churn. Through analyzing the company, we wonder what factors Telco can improve to let customers have greater experiences of using. The best way to know the using experience is based on customers’ survey. Therefore, we choose a dataset about customers survey.
2 Chapter 2: Description of Dataset
2.1 About Dataset
The Telco customer churn data contains information about Telco that provided home phone and Internet services to 7043 customers in California at the end of 2017 Quarter 3. Data includes customers’ basic information and it indicates which customers have left, stayed, or signed up for their service.
Studying such data can help companies identify the characteristics of lost customers, identify potential, soon-to-be-lost customers and develop appropriate strategies to retain them.
The dataset is WA_Fn-UseC_-Telco-Customer-Churn.csv. Before analyzing this dataset, we did some research about what churn represents, and why it is important to avoid churn in business. Churn in this dataset represents lost customers. Some people will be curious about why the company should spend time on retaining current customers or decreasing lost customers. In fact, acquire a new customer is much harder than retaining an existing customer. Company can pay for fewer price to retain existing customers rather than spend large amount of money on advertisement, and it is a profound strategy to maintain good reputation. Therefore, it is crucial to figure out current problems, and then to fix it up.
2.2 Variables
## 'data.frame': 7032 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
## ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...
gender: Female or MaleSeniorCitizen: customer is a senior citizen or not (Yes, No)Partner: customer has a partner or not (Yes, No)Dependents: customer has dependents or not (Yes, No)tenure: number of months the customer has stayed with the companyPhoneService: customer has a phone service or not (Yes, No)MultipleLines: customer has multiple lines or not (Yes, No, No phone service)InternetService: customer’s internet service provider (DSL, Fiber optic, No)OnlineSecurity: customer has online security or not (Yes, No, No internet service)OnlineBackup: customer has online backup or not (Yes, No, No internet service)DeviceProtection: customer has device protection or not (Yes, No, No internet service)TechSupport: customer has tech support or not (Yes, No, No internet service)StreamingTV: customer has streaming TV or not (Yes, No, No internet service)StreamingMovies: customer has streaming movies or not (Yes, No, No internet service)Contract: contract term of the customer (Month-to-month, One year, Two year)PaperlessBilling: customer has paperless billing or not (Yes, No)PaymentMethod: Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)MonthlyCharges: amount charged monthlyTotalCharges: total amount chargedChurn: customer churned or not (Yes or No)
For our exploratory data analysis, we did some preprocessing including cleaning up and converting. We dropped “NA” values from the dataset to simplify our analysis; we converted all variables into factor variables except tenure, MonthlyCharges, TotalCharges
3 Chapter 3: Categorical Variables EDA
A brief overview of the dataset tells that there are 1869 customers left the telephone service company and 5163 who didn’t. Below we’ll call these left customers as churned customers.
3.1 Churn vs Not Churn
3.1.1 What are churned customers look like?
Here we have picked out a few factors from the summary that have a significant difference between factor levels.| SeniorCitizen | Partner | Dependents | PhoneService | InternetService | PaperlessBilling | |
|---|---|---|---|---|---|---|
| X | No :1393 | No :1200 | No :1543 | No : 170 | DSL : 459 | No : 469 |
| X.1 | Yes: 476 | Yes: 669 | Yes: 326 | Yes:1699 | Fiber optic:1297 | Yes:1400 |
| X.2 | NA | NA | NA | NA | No : 113 | NA |
We see that most churned customers are senior citizens. Indeed, the age limit can cause them to leave. Also most churned customers have no dependents, which means they may be older and have their own considerations about choosing a phone company. From the table, we also see that most churned customers don’t have a partner and have signed up for phone service and paperless billing. However, we cannot assume a direct reason at this time, so we’ll talk about this later.
Customers who signed up for an internet service with Fiber Optic quit most. This may be because that they are not satisfied with Fiber Optic, but this speculation must be based on the assumption that total numbers of customers with Fiber Optic and DSL are nearly equal. From the summary, there are 2416 customers with DSL and 3096 customers with Fiber Optic, therefore, the speculation holds.
| OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | |
|---|---|---|---|---|
| X | No :1461 | No :1233 | No :1211 | No :1446 |
| X.1 | No internet service: 0 | No internet service: 0 | No internet service: 0 | No internet service: 0 |
| X.2 | Yes : 295 | Yes : 523 | Yes : 545 | Yes : 310 |
Here we see significant impacts with these four internet service add-on. Most churned customers didn’t sign up for these four add-on.
Now let’s look at thecontract and payment method.| Contract | PaymentMethod | |
|---|---|---|
| X | Month-to-month:1655 | Bank transfer (automatic): 258 |
| X.1 | One year : 166 | Credit card (automatic) : 232 |
| X.2 | Two year : 48 | Electronic check :1071 |
| X.3 | NA | Mailed check : 308 |
Most churned customers have short-term(month-to-month) contract and paid bills with electronic check.
3.1.2 Contrast Churned with not Churned
Since the sample sizes of churned and not churned customers are different, we can’t compare the numbers of customers directly for each attributes. Instead, we compare the percentage.
Here we see the factors that have a significant impact on customer churn or not. Just as we mentioned earlier, these factors also characterize most churned customers.
Remember we previously talked about the partner, phone service and paperless bill, which we are unsure why they have an impact on customer churn, they also create differences between churned and not churned customers. Let’s find out whether these factors have significant impacts on churn with tests.
3.2 Chi-square Tests
We use \(\chi^2\) test to test if two categorical variables are independent base on contingency table.
3.2.1 Test Partner
Are they independent?
- \(H_0\): churn and partner are independent.
- \(H_1\): they are not independent.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: churn_vs_partner
## X-squared = 158, df = 1, p-value <2e-16
Since p-value = 3.97e-36 < 0.05, we reject null hypothesis, so partner actually has a significant impact on churn.
3.2.2 Test PhoneService
Are they independent?
- \(H_0\): churn and phone service are independent.
- \(H_1\): they are not independent.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: churn_vs_phone
## X-squared = 0.9, df = 1, p-value = 0.3
Since p-value = 0.35 > 0.05, we fail to reject null hypothesis, so phone service does not have a significant impact on churn.
3.2.3 Test PaperlessBilling
Are they independent?
- \(H_0\): churn and paperless bill are independent.
- \(H_1\): they are not independent.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: churn_vs_bill
## X-squared = 257, df = 1, p-value <2e-16
Since p-value = 8.24e-58 < 0.05, we reject null hypothesis, and paperless bill service has a significant impact on churn.
4 Chapter 4: Continuous Variables EDA
A brief overview of the dataset tells that there are 3 continuous variables, tenure, monthlycharges and totalcharges.
4.1 kDE plot of continuous variables
KDE plot is a Kernel Density Estimation Plot which depicts the probability density function of the continuous data variables. We can easily observe the distribution of samples with kde plot and when we want to compare the distributions of different samples, it won’t be affected by the samples’ size.
As we can see from tenure_kdeplot, customers with lower tenure are more likely to churn. And from MonthlyCharges_kdeplot, customers with higher monthlycharges are also more likely to churn. From TotalCharges_kdeplot, we can find that churn customers and left customers have very similar distributions. From these 3 kde plots, which means, tenure may be negatively correlated with customer churn rates and monthlycharges may be positively correlated with customer churn rates. Finally, totalcharges may only make a little attribution to customer churn rates.
4.2 Logistic regression
Logistic regression is the appropriate regression analysis to predict a binary outcome (the dependent variable) based on a set of independent variables.
To verify the conclusion we drew from kde plots numerically, we use the logistic regression model to classify churn with different features.
We can see from the anova test results.
## Analysis of Deviance Table
##
## Model 1: Churn ~ tenure
## Model 2: Churn ~ tenure + MonthlyCharges
## Model 3: Churn ~ tenure + MonthlyCharges + TotalCharges
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 7030 7176
## 2 7029 6382 1 794 <2e-16 ***
## 3 7028 6376 1 6 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model 2 is significantly better than model 1. However, model 3 is not under 99% significant level. Which means, model 3 may have not many improvements than model 2.
4.3 AUC and ROC Curve
We can use AUC and ROC to measure model 2 and model 3. AUC (Area Under The Curve) - ROC (Receiver Operating Characteristics) curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between customer with left and churn.
We can compare the ROC curves of two models.
The two ROC curves are almost same. And the AUC of model 2 is 0.808, which means if we randomly choose a churn customer and a left customer, the probability of ranking churn customer higher than left customer is 0.808. AUC of model 3 is 0.809.
Therefore, totalcharges only makes a little attribution to improve the performance of the classification model. In the 3 continuous variables, we can dismiss the influence of totalcharges to customer churn rates.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.7909 | 0.0866 | -20.7 | 0 |
| tenure | -0.0550 | 0.0017 | -32.5 | 0 |
| MonthlyCharges | 0.0329 | 0.0013 | 25.3 | 0 |
In the 3 continuous variables, tenure has negative coefficients with churn and monthlycharges has positive coefficients with churn. It means when we have a customer with lower tenure and high monthlycharges, he has more probabilities to churn. And totalcharges is not significant influence factor to customer churn rates in 3 continuous variables.
5 Chapter 5: EDA with violinplots and scatterplots
From the exploration of each of the above variables, it is known that some of the above variables have a significant effect on customer churn; and from the correlation coefficient plot, it is known that these variables are interrelated. Therefore, we choose to graph the groups of variables with relatively large correlation coefficients (i.e., the absolute value of correlation coefficient is greater than 0.4) one by one to explore how they affect the customer churn rate.
5.1 Simple correlations
Since most of variables are factors, it makes more sense to check their Spearman correlations.
Larger circle means higher correlation. We can see that churn has negative correlation with contract and tenure, which means that customer who stays longer with the company or has a longer contract terms is less likely to churn. Customer who signed up for online security service and has technical support plan is also less likely to churn. So it makes sense that contract and tech support have positive correlation, which means most customers who signed up for a technical support plan also have longer contract term.
5.2 Will Total Charge influence the churn rate with other variables?
First, tenure and totalcharge show a positive correlation. However, except for the time when tenure is less than 20, customer churn rate is higher, and the distribution of customer churn samples in other stages is more dispersed.
Second, for Multiplelines, the sample groups of No multiplelines and No phone service have higher churn rates when totalcharge is not high.
Finally, for the sample group of month-to-month contracts, customers are also prone to churn when the totalcharge is low.
5.3 Will Monthly Charge influence the churn rate with other variables?
First, the first two sample groups with the lowest MonthlyCharges had low customer churn and a high number of customers. However, as MonthlyCharges increased after that, customer churn also started to increase, mainly concentrated when TotalCharges were still low.
Second, the relationship between MultipleLines and MonthlyCharges leads to the following three conclusions.
If the customers have no multiple lines service, they are likely to churn if their monthly charge are greater than 75 or between 30 and 50.
If they have no phone service, we will find an interesting result that the more monthly charge they have, the less customer will churn.
And if the customers have multiple lines service, they are like to churn when they have a monthly charge approximately greater than 70.
5.4 Will Tenure influence the churn rate with other variables?
First, the customers who have no partner and short tenure are more likely to churn.
Second, the customers who have a month-to-month contract and short tenure are more likely to churn. Also, when they have long tenure and a 2-year contract, their possibility of churn is obviously increased.
Finally, the payment method is also a factor. The customers who use electronic check or mailed check are more likely to churn if their tenure is short.
5.5 Will Contract influence the churn rate with other variables?
Since the four services OnlineSecurity, OnlineBackup, TechSupport and DeviceProtection and the variable contract are categorical variables, a direct scatter plot can only see nine points in the two-dimensional plane. Therefore, we first numerate these categorical variables and then discretize their values (i.e., add random numbers to the numerical results so that the points are evenly dispersed in the two-dimensional plane).
These 4 variables (Online Security & Tech Support & Online Backup & Device Protection) are probably influence, because we find that when customer do not have anyone of them, they are more likely to churn if they have a month-to-month contract, except Device Protection. The result shows that there are many people have Device Protection and month-to-month contract but still churn, even though the amount of these customers is less than the one who have no Device Protection.
6 Chapter 6: Conclusion
In the EDA phase we have basically explored our first SMART question, i.e. which of the above variables are the ones that have a significant impact on customer churn. Through these analysis processes, it is easy to find that customers prone to churn have the following characteristics: Month-to-Month contracts, no Phone service or no MultipleLines service, no Partner, which can also show that these customers have a low dependency on the operator and a low cost to leave, economically interpreted as sunk The cost is low. At the same time, most of them choose the Payment Method of Electronic check and Mailed check, which indicates that they may be some customers with conservative consumption habits, and combined with the characteristic that most customers are more likely to churn when Totalcharge is higher, these customers may be dissatisfied with the existing carrier charges and thus choose to leave. . In addition, for customers who demand higher quality of service, the absence of any of OnlineSecurity, OnlineBackup, TechSupport and DeviceProtection may make them less satisfied with the existing service, thus increasing their probability of churning to some extent.
In the process of the above analysis, we have basically completed the summary of the characteristics of churn-prone customers. In the next analysis, we will focus on quantifying the degree of influence of the above variables on customer churn (SMART question 2), as well as analyzing which users are imminent churners based on the model we constructed (SMART question 3). We plan to train some classification models to complete the third question. The final results will be presented in our final project.